Authorship Attribution Using a Neural Network Language Model
نویسندگان
چکیده
In practice, training language models for individual authors is often expensive because of limited data resources. In such cases, Neural Network Language Models (NNLMs), generally outperform the traditional non-parametric N-gram models. Here we investigate the performance of a feed-forward NNLM on an authorship attribution problem, with moderate author set size and relatively limited data. We also consider how the text topics impact performance. Compared with a well-constructed N-gram baseline method with Kneser-Ney smoothing, the proposed method achieves nearly 2.5% reduction in perplexity and increases author classification accuracy by 3.43% on average, given as few as 5 test sentences. The performance is very competitive with the state of the art in terms of accuracy and demand on test data. The source code, preprocessed datasets, a detailed description of the methodology and results are available at https: //github.com/zge/authorship-attribution. Introduction Authorship attribution refers to identifying authors from given texts by their unique textual features. It is challenging since the author’s style may vary from time to time by topics, mood and environment. Many methods have been explored to address this problem, such as Latent Dirichlet Allocation for topic modeling (Seroussi, Zukerman, and Bohnert 2011) and Naive Bayes for text classification (Coyotl-Morales et al. 2006). Regarding language modeling methods, there is mixed advocacy for the conventional N-gram methods (Kešelj et al. 2003) and methods using more compact and distributed representations, like Neural Network Language Models (NNLMs), which was claimed to capture semantics better with limited training data (Bengio et al. 2003). Most NNLM toolkits available (Mikolov et al. 2010) are designed for recurrent NNLMs which are better for capturing complex and longer text patterns and require more training data. In contrast, the feed-forward NNLM framework we proposed is less computationally expensive and more suitable for language modeling with limited data. It is developed in MATLAB with full network tuning functionalities. The database we use is composed of transcripts of 16 video courses taken from Coursera, collected one sentence per line into a text file for each course. To reduce the influence of “topic” on author/instructor classification, courses were all selected from science and engineering fields, such Copyright c © 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. as Algorithm, DSP, Data Mining, IT, Machine Learning, NLP, etc. There are 8000+ sentences/course and about 20 words/sentence on average. The vocabulary size of each author varies from 3000 to 9000. After stemming with Porter’s algorithm and pruning words with frequency less than 1/100, 000, author vocabulary size is reduced to a range from 1800 to 2700, with average size around 2000. Fig. 1 shows the vocabulary size for each course, under various conditions and the database coverage with the most frequent k words (k = 500, 1000, 2000) after stemming and pruning. 0 2 4 6 8 10 12 14 16 0 500
منابع مشابه
Domain Specific Author Attribution based on Feedforward Neural Network Language Models
Authorship attribution refers to the task of automatically determining the author based on a given sample of text. It is a problem with a long history and has a wide range of application. Building author profiles using language models is one of the most successful methods to automate this task. New language modeling methods based on neural networks alleviate the curse of dimensionality and usua...
متن کاملAuthorship attribution of source code by using back propagation neural network based on particle swarm optimization
Authorship attribution is to identify the most likely author of a given sample among a set of candidate known authors. It can be not only applied to discover the original author of plain text, such as novels, blogs, emails, posts etc., but also used to identify source code programmers. Authorship attribution of source code is required in diverse applications, ranging from malicious code trackin...
متن کاملContinuous N-gram Representations for Authorship Attribution
This paper presents work on using continuous representations for authorship attribution. In contrast to previous work, which uses discrete feature representations, our model learns continuous representations for n-gram features via a neural network jointly with the classification layer. Experimental results demonstrate that the proposed model outperforms the state-of-the-art on two datasets, wh...
متن کاملWallace: Author Detection via Recurrent Neural Networks
Author detection or author attribution is an important field in NLP that enables us to verify the authorship of papers or novels and allows us to identify anonymous authors. In our approach to this classic problem, we attempt to classify a broad set of literary works by a large number of distinct authors using traditional and deep-learning techniques, including Multinomial Naive Bayes, linear S...
متن کاملConvolutional Neural Networks for Authorship Attribution of Short Texts
We present a model to perform authorship attribution of tweets using Convolutional Neural Networks (CNNs) over character n-grams. We also present a strategy that improves model interpretability by estimating the importance of input text fragments in the predicted classification. The experimental evaluation shows that text CNNs perform competitively and are able to outperform previous methods.
متن کاملAuthorship Attribution Using Word Network Features
In this paper, we explore a set of novel features for authorship attribution of documents. These features are derived from a word network representation of natural language text. As has been noted in previous studies, natural language tends to show complex network structure at word level, with low degrees of separation and scale-free (power law) degree distribution. There has also been work on ...
متن کامل